{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 目录\n",
    "* [Introduction](#1)\n",
    "* [NLP five major steps](#2)\n",
    "* [Corpus](#3)\n",
    "* [Tokenize](#4)\n",
    "* [Stopwords](#5)\n",
    "* [Bag of Words](#6)\n",
    "* [Count Vectorizer](#7)\n",
    "* [TF-IDF](#8)\n",
    "* [Text Classification](#9)\n",
    "* [Evaluation](#10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2 id='1'>Introduction</h2>\n",
    "\n",
    "- NLP :\n",
    "\n",
    "&emsp;&emsp;The area that focuses on making machines learn and understand the textual data\n",
    "in order to perform some useful tasks is known as Natural Language\n",
    "Processing (NLP). \n",
    "- 应用领域：\n",
    "    - chatbot\n",
    "    - speech recognition\n",
    "    - translation\n",
    "    - spam detection\n",
    "    - sentiment analysis\n",
    "    - etc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2 id='2'>NLP five major steps</h2>\n",
    "\n",
    "- Corpus(语料库)\n",
    "- Tokenization(符号化)\n",
    "- Cleaning/Stopword removal(清洗/禁用停用词\n",
    "- Stemming(词干), 词干是一个词语去掉表示语法意义的词尾剩余的部分\n",
    "    - 举个栗子：“老师们”,“老师”是词干,“们”是词尾\n",
    "- Converting into Numerical "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2 id='3'> Corpus </h2>\n",
    "\n",
    "&emsp;&emsp;语料库被 : 文本文档的集合。例如，假设在一个集合中有数千封电子邮件，我们需要处理和分析这些邮件。这组电子邮件就是语料库。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2 id='4'>Tokenize</h2>\n",
    "\n",
    "&emsp;&emsp;将文本文档中给定的句子或单词集合划分为单个词汇。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "from pyspark.sql import SparkSession\n",
    "\n",
    "spark = SparkSession.builder.appName('nlp').getOrCreate()\n",
    "\n",
    "df = spark.createDataFrame([(1, 'I really liked this movie'),\n",
    "                                               (2, 'I would recommend this movie to my friends'),\n",
    "                                               (3, 'movie was alright but acting was horrible'),\n",
    "                                               (4, 'I am never watching that movie ever again')],\n",
    "                                                  ['user_id', 'review'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------+------------------------------------------+\n",
      "|user_id|review                                    |\n",
      "+-------+------------------------------------------+\n",
      "|1      |I really liked this movie                 |\n",
      "|2      |I would recommend this movie to my friends|\n",
      "|3      |movie was alright but acting was horrible |\n",
      "|4      |I am never watching that movie ever again |\n",
      "+-------+------------------------------------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df.show(4, False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------+------------------------------------------+---------------------------------------------------+\n",
      "|user_id|review                                    |tokenized                                          |\n",
      "+-------+------------------------------------------+---------------------------------------------------+\n",
      "|1      |I really liked this movie                 |[i, really, liked, this, movie]                    |\n",
      "|2      |I would recommend this movie to my friends|[i, would, recommend, this, movie, to, my, friends]|\n",
      "|3      |movie was alright but acting was horrible |[movie, was, alright, but, acting, was, horrible]  |\n",
      "|4      |I am never watching that movie ever again |[i, am, never, watching, that, movie, ever, again] |\n",
      "+-------+------------------------------------------+---------------------------------------------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from pyspark.ml.feature import Tokenizer\n",
    "\n",
    "tokenization = Tokenizer(inputCol='review', outputCol='tokenized')\n",
    "tokenized_df = tokenization.transform(df)\n",
    "tokenized_df.show(4, False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2 id='5'>Stopwords Removal</h2>\n",
    "\n",
    "&emsp;&emsp;为节省存储空间和提高搜索效率，在处理自然语言数据时会自动过滤掉某些字或词，这些字或词即被称为Stop Words,这些词没有什么实际含义,比如：（“the”、“a”、“an”、“that”、和“those”）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------+---------------------------------------------------+----------------------------------+\n",
      "|user_id|tokenized                                          |new_tokenized                     |\n",
      "+-------+---------------------------------------------------+----------------------------------+\n",
      "|1      |[i, really, liked, this, movie]                    |[really, liked, movie]            |\n",
      "|2      |[i, would, recommend, this, movie, to, my, friends]|[recommend, movie, friends]       |\n",
      "|3      |[movie, was, alright, but, acting, was, horrible]  |[movie, alright, acting, horrible]|\n",
      "|4      |[i, am, never, watching, that, movie, ever, again] |[never, watching, movie, ever]    |\n",
      "+-------+---------------------------------------------------+----------------------------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from pyspark.ml.feature import StopWordsRemover\n",
    "\n",
    "sw_removal = StopWordsRemover(inputCol='tokenized', outputCol='new_tokenized')\n",
    "\n",
    "new_df  = sw_removal.transform(tokenized_df)\n",
    "\n",
    "new_df.select(['user_id', 'tokenized', 'new_tokenized']).show(4, False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Converting text data into numberical vector\n",
    "\n",
    "- Bag of Word\n",
    "- Count Vector\n",
    "- TF-IDF"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2 id='6'>Bag of Words</h2>\n",
    "\n",
    "&emsp;&emsp;BOW,不关注单词在文档中出现的顺序语义和次数，只关注词汇是否在文档中出现，是一种最简单的数值表示方法。\n",
    "\n",
    "举个栗子：\n",
    "- doc 1: The best thing in life is to travel\n",
    "- doc 2: Travel is the best medicine\n",
    "- doc 3: One should travel more often\n",
    "\n",
    "词表：所有文档中词汇集合组成的列表。\n",
    "\n",
    "doc1,2,3由13个单词组成，词表如下：\n",
    "\n",
    "the    best    thing    in    life    is    to    travel    medicine    one    should    more    often\n",
    "\n",
    "- doc 1 vector: 1 1 1 1 1 1 1 1 0 0 0 0 0\n",
    "- doc 2 vector: 1 1 0 0 0 1 0 1 1 0 0 0 0\n",
    "- doc 3 vector: 0 0 0 0 0 0 0 1 0 1 1 1 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2 id='7'>Count Vectorizer</h2>\n",
    "\n",
    "&emsp;&emsp;count vectorizer和BOW非常相似，它也不关注词的顺序和语义，但是它统计词汇在文档中出现的频次。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------+--------------------+--------------------+--------------------+\n",
      "|user_id|              review|           tokenized|       new_tokenized|\n",
      "+-------+--------------------+--------------------+--------------------+\n",
      "|      1|I really liked th...|[i, really, liked...|[really, liked, m...|\n",
      "|      2|I would recommend...|[i, would, recomm...|[recommend, movie...|\n",
      "|      3|movie was alright...|[movie, was, alri...|[movie, alright, ...|\n",
      "|      4|I am never watchi...|[i, am, never, wa...|[never, watching,...|\n",
      "+-------+--------------------+--------------------+--------------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "new_df.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------+---------------------------------+----------------------------------+\n",
      "|user_id|count_vector                     |new_tokenized                     |\n",
      "+-------+---------------------------------+----------------------------------+\n",
      "|1      |(11,[0,1,4],[1.0,1.0,1.0])       |[really, liked, movie]            |\n",
      "|2      |(11,[0,3,5],[1.0,1.0,1.0])       |[recommend, movie, friends]       |\n",
      "|3      |(11,[0,2,6,9],[1.0,1.0,1.0,1.0]) |[movie, alright, acting, horrible]|\n",
      "|4      |(11,[0,7,8,10],[1.0,1.0,1.0,1.0])|[never, watching, movie, ever]    |\n",
      "+-------+---------------------------------+----------------------------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from pyspark.ml.feature import CountVectorizer\n",
    "\n",
    "count_vectorizer = CountVectorizer(inputCol='new_tokenized', outputCol='count_vector')\n",
    "\n",
    "cv_df = count_vectorizer.fit(new_df).transform(new_df)\n",
    "\n",
    "cv_df.select(['user_id', 'count_vector', 'new_tokenized']).show(4, False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 词汇表"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['movie',\n",
       " 'liked',\n",
       " 'alright',\n",
       " 'recommend',\n",
       " 'friends',\n",
       " 'never',\n",
       " 'acting',\n",
       " 'horrible',\n",
       " 'really',\n",
       " 'ever',\n",
       " 'watching']"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "count_vectorizer.fit(new_df).vocabulary"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2 id='8'>TF-IDF</h2>\n",
    "\n",
    "&emsp;&emsp;这个方法是count vector的后延,它对词频进行归一化。整个想法是，如果这个词在同一份文档中出现了很多次，那么它就会得到更多的重视，但如果它在其他文档中也均出现，这表明一个单词在整个语料库中是常见的，它的权重应该降低。\n",
    "\n",
    "### TF : Term Frequency\n",
    "\n",
    "- word 在文档中出现的频率\n",
    "\n",
    "### IDF : Inverse Document Frequency\n",
    "\n",
    "- 包含某个word文档的频率"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------+----------------------------------+-------------------------------------------------------+\n",
      "|user_id|new_tokenized                     |tf_vector                                              |\n",
      "+-------+----------------------------------+-------------------------------------------------------+\n",
      "|1      |[really, liked, movie]            |(262144,[14,32675,155321],[1.0,1.0,1.0])               |\n",
      "|2      |[recommend, movie, friends]       |(262144,[129613,155321,222394],[1.0,1.0,1.0])          |\n",
      "|3      |[movie, alright, acting, horrible]|(262144,[80824,155321,236263,240286],[1.0,1.0,1.0,1.0])|\n",
      "|4      |[never, watching, movie, ever]    |(262144,[63139,155321,203802,245806],[1.0,1.0,1.0,1.0])|\n",
      "+-------+----------------------------------+-------------------------------------------------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from pyspark.ml.feature import HashingTF, IDF\n",
    "\n",
    "hashing_vector = HashingTF(inputCol='new_tokenized', outputCol='tf_vector')\n",
    "\n",
    "hashing_df = hashing_vector.transform(new_df)\n",
    "\n",
    "hashing_df.select(['user_id', 'new_tokenized', 'tf_vector']).show(4, False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### TF-IDF calculate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------+----------------------------------------------------------------------------------------------------+\n",
      "|user_id|tf_idf_vector                                                                                       |\n",
      "+-------+----------------------------------------------------------------------------------------------------+\n",
      "|1      |(262144,[14,32675,155321],[0.9162907318741551,0.9162907318741551,0.0])                              |\n",
      "|2      |(262144,[129613,155321,222394],[0.9162907318741551,0.0,0.9162907318741551])                         |\n",
      "|3      |(262144,[80824,155321,236263,240286],[0.9162907318741551,0.0,0.9162907318741551,0.9162907318741551])|\n",
      "|4      |(262144,[63139,155321,203802,245806],[0.9162907318741551,0.0,0.9162907318741551,0.9162907318741551])|\n",
      "+-------+----------------------------------------------------------------------------------------------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "tf_idf_vector = IDF(inputCol='tf_vector', outputCol='tf_idf_vector')\n",
    "\n",
    "tf_idf_df = tf_idf_vector.fit(hashing_df).transform(hashing_df)\n",
    "\n",
    "tf_idf_df.select(['user_id','tf_idf_vector']).show(4, False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2 id='9'> Text Classification</h2>\n",
    "\n",
    "- dataset : Movie Lens reviews data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "root\n",
      " |-- Review: string (nullable = true)\n",
      " |-- Sentiment: string (nullable = true)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "text_df = spark.read.csv('./Data/Movie_reviews.csv', inferSchema=True, header=True, sep=',')\n",
    "\n",
    "text_df.printSchema()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "7087"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text_df.count()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "6990"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 过滤出有标记的，作为train_df\n",
    "\n",
    "train_df = text_df.filter(((text_df.Sentiment =='1') | (text_df.Sentiment == '0')))\n",
    "\n",
    "train_df.count()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Text(0.5, 1.0, 'Label Count')"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYAAAAEFCAYAAADqujDUAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvOIA7rQAAF0VJREFUeJzt3X+QXWV9x/H3h00gAqn5tY1JNnFTCYXQQnC2AaHWkChZsBqcguV3TKlrZ8KgA1WDTgdFMoNVgUKRmdgEglJiAJUVU2mEMA4ikI3EyBIpKwSzMZIlCQhNCfnx7R/32XBddrP37t7dm+T5vGbu7Dnf85xznpPZ3M+e34oIzMwsP4dVuwNmZlYdDgAzs0w5AMzMMuUAMDPLlAPAzCxTDgAzs0w5ACwLkh6R9I+DPa/ZgcwBYAcVSRskfbDa/Sgm6VhJ90h6WdKrktZJulJSzQCv9w5J1w3kOuzQ5gAw6wdJ7wGeADYCfxkR7wTOAxqA4dXsm1lvHAB2SJA0UtIDkjokbU/DdV2avUfSk5L+IOl+SaOK5j9V0mOSXpH0S0kzSlz1l4HHIuLKiNgMEBHPRsSFEfFKWvZHJbWmZT8i6fii9YakY4rG9/1VL2mGpHZJV0naImmzpHlpWhNwEfA5Sa9L+mH5/2qWOweAHSoOA24H3g1MAv4P+PcubS4F/gEYB+wGbgaQNAH4EXAdMAr4Z+A+SbUlrPeDwL09TZR0LHA38BmgFlgB/FDS4SVu17uAdwITgMuAWyWNjIhFwF3Av0bE0RHxkRKXZ7aPA8AOCRGxNSLui4gdEfEasBD4QJdm346IpyPif4F/AT6ejtNfDKyIiBURsTciVgItwNklrHo0sHk/0/8e+FFErIyIXcDXgXcAp5W4abuAayNiV0SsAF4H/rzEec32a0i1O2BWCZKOBG4EGoGRqTxcUk1E7EnjG4tmeREYCoyhsNdwnqTiv6KHAqtKWPVWCnsUPRmf1gVAROyVtJHCX/Sl2BoRu4vGdwBHlziv2X55D8AOFVdR+Mv4lIj4E+BvUl1FbSYWDU+i8Nf1yxSC4dsRMaLoc1REXF/Cen8C/N1+pv+OQsAUOiMp9WNTKu0Ajixq/64S1tnJj/K1fnEA2MFoqKRhRZ8hFK64+T/glXRy95pu5rtY0tS0t3AtcG/aO/gO8BFJsyXVpGXO6OYkcneuAU6T9DVJ7wKQdIyk70gaASwHPixplqShFIJqJ/BYmn8tcGFabyNvP2y1Py8Bf1ZGe7M/4gCwg9EKCl/2nZ8vATdROLb+MvA48ONu5vs2cAfwe2AYcAVARGwE5gBfADoo7BF8lhL+f0TEb4D3AfVAq6RXgfsonEN4LSKepXCO4ZbUt48AH4mIN9MiPp1qr1C4qucHJf4bACwGpqari8qZzwwA+YUwZmZ58h6AmVmmHABmZplyAJiZZcoBYGaWKQeAmVmmDug7gceMGRP19fXV7oaZ2UFlzZo1L0dEr8+yOqADoL6+npaWlmp3w8zsoCLpxd5blXEIKN2p+JSkB9L4ZElPSGqT9N3OpxtKOiKNt6Xp9UXLuDrVn5U0u7xNMjOzSirnHMCngfVF418FboyIY4DtFB5VS/q5PdVvTO2QNBU4HziBwgO7vjnQb0wyM7OelRQA6ZkoHwb+I40LmMlbz0FfCpyThuekcdL0Wan9HGBZROyMiBeANmB6JTbCzMzKV+o5gJuAz/HWK+5GA68UPaa2nbcebzuB9NjdiNidno0yOtUfL1pm8Twl27VrF+3t7bzxxhvlznpIGDZsGHV1dQwdOrTaXTGzg1yvASDpb4EtEbGmjNfk9Vl61V0TwKRJk942vb29neHDh1NfX09hxyIfEcHWrVtpb29n8uTJ1e6OmR3kSjkEdDrwUUkbgGUUDv38GzAiPYYXoI63nm++ifTc9TT9nRRemrGv3s08+0TEoohoiIiG2tq3X8X0xhtvMHr06Oy+/AEkMXr06Gz3fsysskp53O3VEVEXEfUUTuI+HBEXUXhb0rmp2Vzg/jTcnMZJ0x+OwiNHm4Hz01VCk4EpwJN96XSOX/6dct52M6us/twJ/HngSkltFI7xL071xcDoVL8SWAAQEa0UXo7xDIVntc8velWfFbnpppvYsWNHtbthZoe4sm4Ei4hHgEfS8PN0cxVPRLwBnNfD/AspvKy7YuoX/KiSi2PD9R+u6PL64qabbuLiiy/myCOP7L2xWS8q/X8kZwfC90Ml+VlAfXTnnXdy4oknctJJJ3HJJZewYcMGZs6cyYknnsisWbP47W9/C8AnPvEJ7r333n3zHX104X3ejzzyCDNmzODcc8/luOOO46KLLiIiuPnmm/nd737HGWecwRlnnFGVbTOzPBzQj4I4ULW2tnLdddfx2GOPMWbMGLZt28bcuXP3fZYsWcIVV1zBD36w/7f0PfXUU7S2tjJ+/HhOP/10fvazn3HFFVdwww03sGrVKsaMGTNIW2RmOfIeQB88/PDDnHfeefu+oEeNGsXPf/5zLrzwQgAuueQSHn300V6XM336dOrq6jjssMOYNm0aGzZsGMhum5n9EQfAABsyZAh79+4FYO/evbz55pv7ph1xxBH7hmtqati9e/fb5jczGygOgD6YOXMm99xzD1u3bgVg27ZtnHbaaSxbtgyAu+66i/e///1A4Ymma9asAaC5uZldu3b1uvzhw4fz2muvDVDvzcwKfA6gD0444QS++MUv8oEPfICamhpOPvlkbrnlFubNm8fXvvY1amtruf322wH45Cc/yZw5czjppJNobGzkqKOO6nX5TU1NNDY2Mn78eFatWjXQm2NmmVLhHq0DU0NDQ3R9H8D69es5/vjjq9SjA4P/Dawcvgy0cg6Wy0AlrYmIht7a+RCQmVmmHABmZplyAJiZZeqgDIAD+bzFQMt5282ssg66ABg2bBhbt27N8ouw830Aw4YNq3ZXzOwQcNBdBlpXV0d7ezsdHR3V7kpVdL4RzMysvw66ABg6dKjfhmVmVgEH3SEgMzOrDAeAmVmmHABmZpnqNQAkDZP0pKRfSmqV9OVUv0PSC5LWps+0VJekmyW1SVon6b1Fy5or6bn0mdvTOs3MbOCVchJ4JzAzIl6XNBR4VNJ/pWmfjYh7u7Q/i8IL36cApwC3AadIGgVcAzQAAayR1BwR2yuxIWZmVp5e9wCi4PU0OjR99ncR/hzgzjTf48AISeOA2cDKiNiWvvRXAo39676ZmfVVSecAJNVIWgtsofAl/kSatDAd5rlRUufbTSYAG4tmb0+1nupmZlYFJQVAROyJiGlAHTBd0l8AVwPHAX8FjAI+X4kOSWqS1CKpJdebvczMBkNZVwFFxCvAKqAxIjanwzw7gduB6anZJmBi0Wx1qdZTves6FkVEQ0Q01NbWltM9MzMrQylXAdVKGpGG3wF8CPh1Oq6PJAHnAE+nWZqBS9PVQKcCr0bEZuBB4ExJIyWNBM5MNTMzq4JSrgIaByyVVEMhMJZHxAOSHpZUCwhYC/xTar8COBtoA3YA8wAiYpukrwCrU7trI2Jb5TbFzMzK0WsARMQ64ORu6jN7aB/A/B6mLQGWlNlHMzMbAL4T2MwsUw4AM7NMOQDMzDLlADAzy5QDwMwsUw4AM7NMOQDMzDLlADAzy5QDwMwsUw4AM7NMOQDMzDLlADAzy5QDwMwsUw4AM7NMOQDMzDLlADAzy5QDwMwsU6W8EtJ6Ub/gR9XuwiFlw/UfrnYXzLJQykvhh0l6UtIvJbVK+nKqT5b0hKQ2Sd+VdHiqH5HG29L0+qJlXZ3qz0qaPVAbZWZmvSvlENBOYGZEnARMAxolnQp8FbgxIo4BtgOXpfaXAdtT/cbUDklTgfOBE4BG4JvpRfNmZlYFvQZAFLyeRoemTwAzgXtTfSlwThqek8ZJ02dJUqovi4idEfEC0AZMr8hWmJlZ2Uo6CSypRtJaYAuwEvgN8EpE7E5N2oEJaXgCsBEgTX8VGF1c72ae4nU1SWqR1NLR0VH+FpmZWUlKCoCI2BMR04A6Cn+1HzdQHYqIRRHREBENtbW1A7UaM7PslXUZaES8AqwC3geMkNR5FVEdsCkNbwImAqTp7wS2Fte7mcfMzAZZKVcB1UoakYbfAXwIWE8hCM5NzeYC96fh5jROmv5wRESqn5+uEpoMTAGerNSGmJlZeUq5D2AcsDRdsXMYsDwiHpD0DLBM0nXAU8Di1H4x8G1JbcA2Clf+EBGtkpYDzwC7gfkRsaeym2NmZqXqNQAiYh1wcjf15+nmKp6IeAM4r4dlLQQWlt9NMzOrND8KwswsUw4AM7NMOQDMzDLlADAzy5QDwMwsUw4AM7NMOQDMzDLlADAzy5QDwMwsUw4AM7NMOQDMzDLlADAzy5QDwMwsUw4AM7NMOQDMzDLlADAzy5QDwMwsU6W8E3iipFWSnpHUKunTqf4lSZskrU2fs4vmuVpSm6RnJc0uqjemWpukBQOzSWZmVopS3gm8G7gqIn4haTiwRtLKNO3GiPh6cWNJUym8B/gEYDzwE0nHpsm3UnipfDuwWlJzRDxTiQ0xM7PylPJO4M3A5jT8mqT1wIT9zDIHWBYRO4EX0svhO98d3JbeJYykZamtA8DMrArKOgcgqZ7CC+KfSKXLJa2TtETSyFSbAGwsmq091Xqqm5lZFZQcAJKOBu4DPhMRfwBuA94DTKOwh/CNSnRIUpOkFkktHR0dlVikmZl1o6QAkDSUwpf/XRHxPYCIeCki9kTEXuBbvHWYZxMwsWj2ulTrqf5HImJRRDRERENtbW2522NmZiUq5SogAYuB9RFxQ1F9XFGzjwFPp+Fm4HxJR0iaDEwBngRWA1MkTZZ0OIUTxc2V2QwzMytXKVcBnQ5cAvxK0tpU+wJwgaRpQAAbgE8BRESrpOUUTu7uBuZHxB4ASZcDDwI1wJKIaK3gtpiZWRlKuQroUUDdTFqxn3kWAgu7qa/Y33xmZjZ4fCewmVmmHABmZplyAJiZZcoBYGaWKQeAmVmmHABmZplyAJiZZcoBYGaWKQeAmVmmHABmZplyAJiZZcoBYGaWKQeAmVmmHABmZplyAJiZZcoBYGaWKQeAmVmmHABmZpkq5aXwEyWtkvSMpFZJn071UZJWSnou/RyZ6pJ0s6Q2SeskvbdoWXNT++ckzR24zTIzs96UsgewG7gqIqYCpwLzJU0FFgAPRcQU4KE0DnAWMCV9moDboBAYwDXAKcB04JrO0DAzs8HXawBExOaI+EUafg1YD0wA5gBLU7OlwDlpeA5wZxQ8DoyQNA6YDayMiG0RsR1YCTRWdGvMzKxkZZ0DkFQPnAw8AYyNiM1p0u+BsWl4ArCxaLb2VOup3nUdTZJaJLV0dHSU0z0zMytDyQEg6WjgPuAzEfGH4mkREUBUokMRsSgiGiKioba2thKLNDOzbpQUAJKGUvjyvysivpfKL6VDO6SfW1J9EzCxaPa6VOupbmZmVVDKVUACFgPrI+KGoknNQOeVPHOB+4vql6argU4FXk2Hih4EzpQ0Mp38PTPVzMysCoaU0OZ04BLgV5LWptoXgOuB5ZIuA14EPp6mrQDOBtqAHcA8gIjYJukrwOrU7tqI2FaRrTAzs7L1GgAR8SigHibP6qZ9APN7WNYSYEk5HTQzs4HhO4HNzDLlADAzy5QDwMwsUw4AM7NMOQDMzDLlADAzy5QDwMwsUw4AM7NMOQDMzDLlADAzy5QDwMwsUw4AM7NMOQDMzDLlADAzy5QDwMwsUw4AM7NMOQDMzDJVyjuBl0jaIunpotqXJG2StDZ9zi6adrWkNknPSppdVG9MtTZJCyq/KWZmVo5S9gDuABq7qd8YEdPSZwWApKnA+cAJaZ5vSqqRVAPcCpwFTAUuSG3NzKxKSnkn8E8l1Ze4vDnAsojYCbwgqQ2Ynqa1RcTzAJKWpbbPlN1jMzOriP6cA7hc0rp0iGhkqk0ANha1aU+1nupmZlYlfQ2A24D3ANOAzcA3KtUhSU2SWiS1dHR0VGqxZmbWRZ8CICJeiog9EbEX+BZvHebZBEwsalqXaj3Vu1v2oohoiIiG2travnTPzMxK0KcAkDSuaPRjQOcVQs3A+ZKOkDQZmAI8CawGpkiaLOlwCieKm/vebTMz669eTwJLuhuYAYyR1A5cA8yQNA0IYAPwKYCIaJW0nMLJ3d3A/IjYk5ZzOfAgUAMsiYjWim+NmZmVrJSrgC7oprx4P+0XAgu7qa8AVpTVOzMzGzC+E9jMLFMOADOzTDkAzMwy5QAwM8uUA8DMLFMOADOzTDkAzMwy5QAwM8uUA8DMLFMOADOzTDkAzMwy5QAwM8uUA8DMLFMOADOzTDkAzMwy5QAwM8uUA8DMLFMOADOzTPUaAJKWSNoi6emi2ihJKyU9l36OTHVJullSm6R1kt5bNM/c1P45SXMHZnPMzKxUpewB3AE0dqktAB6KiCnAQ2kc4CxgSvo0AbdBITAovEz+FGA6cE1naJiZWXX0GgAR8VNgW5fyHGBpGl4KnFNUvzMKHgdGSBoHzAZWRsS2iNgOrOTtoWJmZoOor+cAxkbE5jT8e2BsGp4AbCxq155qPdXfRlKTpBZJLR0dHX3snpmZ9abfJ4EjIoCoQF86l7coIhoioqG2trZSizUzsy76GgAvpUM7pJ9bUn0TMLGoXV2q9VQ3M7Mq6WsANAOdV/LMBe4vql+argY6FXg1HSp6EDhT0sh08vfMVDMzsyoZ0lsDSXcDM4AxktopXM1zPbBc0mXAi8DHU/MVwNlAG7ADmAcQEdskfQVYndpdGxFdTyybmdkg6jUAIuKCHibN6qZtAPN7WM4SYElZvTMzswHjO4HNzDLlADAzy5QDwMwsUw4AM7NMOQDMzDLlADAzy5QDwMwsUw4AM7NMOQDMzDLlADAzy5QDwMwsUw4AM7NMOQDMzDLlADAzy5QDwMwsUw4AM7NMOQDMzDLVrwCQtEHSryStldSSaqMkrZT0XPo5MtUl6WZJbZLWSXpvJTbAzMz6phJ7AGdExLSIaEjjC4CHImIK8FAaBzgLmJI+TcBtFVi3mZn10UAcApoDLE3DS4Fziup3RsHjwAhJ4wZg/WZmVoL+BkAA/y1pjaSmVBsbEZvT8O+BsWl4ArCxaN72VDMzsyoY0s/5/zoiNkn6U2ClpF8XT4yIkBTlLDAFSRPApEmT+tk9MzPrSb/2ACJiU/q5Bfg+MB14qfPQTvq5JTXfBEwsmr0u1bouc1FENEREQ21tbX+6Z2Zm+9HnAJB0lKThncPAmcDTQDMwNzWbC9yfhpuBS9PVQKcCrxYdKjIzs0HWn0NAY4HvS+pczn9GxI8lrQaWS7oMeBH4eGq/AjgbaAN2APP6sW4zM+unPgdARDwPnNRNfSswq5t6APP7uj4zM6ss3wlsZpYpB4CZWaYcAGZmmXIAmJllygFgZpYpB4CZWaYcAGZmmXIAmJllygFgZpYpB4CZWaYcAGZmmXIAmJllygFgZpYpB4CZWaYcAGZmmXIAmJllygFgZpYpB4CZWaYGPQAkNUp6VlKbpAWDvX4zMysY1ACQVAPcCpwFTAUukDR1MPtgZmYFg70HMB1oi4jnI+JNYBkwZ5D7YGZmwJBBXt8EYGPReDtwSnEDSU1AUxp9XdKzg9S3HIwBXq52J3qjr1a7B1YlB/zv50H0u/nuUhoNdgD0KiIWAYuq3Y9DkaSWiGiodj/MuuPfz8E32IeANgETi8brUs3MzAbZYAfAamCKpMmSDgfOB5oHuQ9mZsYgHwKKiN2SLgceBGqAJRHROph9yJwPrdmBzL+fg0wRUe0+mJlZFfhOYDOzTDkAzMwy5QAwM8vUAXcfgFWOpOMo3Gk9IZU2Ac0Rsb56vTKzA4X3AA5Rkj5P4VEbAp5MHwF3+yF8diCTNK/afciFrwI6REn6H+CEiNjVpX440BoRU6rTM7P9k/TbiJhU7X7kwIeADl17gfHAi13q49I0s6qRtK6nScDYwexLzhwAh67PAA9Jeo63HsA3CTgGuLxqvTIrGAvMBrZ3qQt4bPC7kycHwCEqIn4s6VgKj+AuPgm8OiL2VK9nZgA8ABwdEWu7TpD0yOB3J08+B2BmlilfBWRmlikHgJlZphwAZmaZcgCYmWXKAWBmlqn/B5d6oEJabyetAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "\n",
    "\n",
    "label_count_df = train_df.groupBy('Sentiment').count().toPandas()\n",
    "\n",
    "label_count_df.plot(kind='bar')\n",
    "\n",
    "plt.title('Label Count')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+------------------------------------------------------------------------+---------+\n",
      "|Review                                                                  |Sentiment|\n",
      "+------------------------------------------------------------------------+---------+\n",
      "|The Da Vinci Code is awesome!!                                          |1        |\n",
      "|So Brokeback Mountain was really depressing.                            |0        |\n",
      "|friday hung out with kelsie and we went and saw The Da Vinci Code SUCKED|0        |\n",
      "|Always knows what I want, not guy crazy, hates Harry Potter..           |0        |\n",
      "|The Da Vinci Code was awesome, I can't wait to read it...               |1        |\n",
      "|Harry Potter dragged Draco Malfoy ’ s trousers down past his hips and   |0        |\n",
      "|the people who are worth it know how much i love the da vinci code.     |1        |\n",
      "|The Da Vinci Code is awesome..                                          |1        |\n",
      "|DA VINCI CODE IS AWESOME!!                                              |1        |\n",
      "|So Brokeback Mountain was really depressing.                            |0        |\n",
      "+------------------------------------------------------------------------+---------+\n",
      "only showing top 10 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# 随机\n",
    "from pyspark.sql.functions import rand\n",
    "\n",
    "\n",
    "train_df.orderBy(rand()).show(10, False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-----------------------------------------------------------------------+-----+\n",
      "|Review                                                                 |Label|\n",
      "+-----------------------------------------------------------------------+-----+\n",
      "|Oh, and Brokeback Mountain is a TERRIBLE movie...                      |0    |\n",
      "|Harry Potter dragged Draco Malfoy ’ s trousers down past his hips and  |0    |\n",
      "|These Harry Potter movies really suck.                                 |0    |\n",
      "|watched mission impossible 3 wif stupid haha...                        |0    |\n",
      "|i love being a sentry for mission impossible and a station for bonkers.|1    |\n",
      "|I love The Da Vinci Code...                                            |1    |\n",
      "|I either LOVE Brokeback Mountain or think it's great that homosexuality|1    |\n",
      "|Brokeback Mountain was so awesome.                                     |1    |\n",
      "|dudeee i LOVED brokeback mountain!!!!                                  |1    |\n",
      "|\"Anyway, thats why I love \"\" Brokeback Mountain.\"                      |1    |\n",
      "+-----------------------------------------------------------------------+-----+\n",
      "only showing top 10 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# create a new label columns\n",
    "\n",
    "train_df = train_df.withColumn('Label', train_df.Sentiment.cast('int')).drop('Sentiment')\n",
    "\n",
    "train_df.orderBy(rand()).show(10, False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+------------------------------------------------------------------------+-----+------+\n",
      "|Review                                                                  |Label|length|\n",
      "+------------------------------------------------------------------------+-----+------+\n",
      "|\"\"\" I hate Harry Potter.\"                                               |0    |25    |\n",
      "|I love Harry Potter.                                                    |1    |20    |\n",
      "|Harry Potter is AWESOME I don't care if anyone says differently!..      |1    |66    |\n",
      "|Oh, and Brokeback Mountain is a TERRIBLE movie...                       |0    |49    |\n",
      "|The Da Vinci Code is awesome..                                          |1    |30    |\n",
      "|Harry Potter dragged Draco Malfoy ’ s trousers down past his hips and   |0    |69    |\n",
      "|, she helped me bobbypin my insanely cool hat to my head, and she laughe|0    |72    |\n",
      "|I thought Brokeback Mountain was an awful movie.                        |0    |48    |\n",
      "|I wanted desperately to love'The Da Vinci Code as a film.               |1    |57    |\n",
      "|the story of Harry Potter is a deep and profound one, and I love Harry P|1    |72    |\n",
      "+------------------------------------------------------------------------+-----+------+\n",
      "only showing top 10 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# add length column\n",
    "from pyspark.sql .functions import length\n",
    "\n",
    "\n",
    "train_df = train_df.withColumn('length', length(train_df['Review']))\n",
    "\n",
    "train_df.orderBy(rand()).show(10, False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-----+-----------------+\n",
      "|Label|      avg(Length)|\n",
      "+-----+-----------------+\n",
      "|    1|47.61882834484523|\n",
      "|    0|50.95845504706264|\n",
      "+-----+-----------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# 正负样本的平均长度\n",
    "\n",
    "\n",
    "train_df.groupBy('Label').agg({'Length':'mean'}).show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-------------------------------------------------------------+\n",
      "|Review                                                                  |review_token                                                                            |new_token                                                    |\n",
      "+------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-------------------------------------------------------------+\n",
      "|The Da Vinci Code book is just awesome.                                 |[the, da, vinci, code, book, is, just, awesome.]                                        |[da, vinci, code, book, awesome.]                            |\n",
      "|this was the first clive cussler i've ever read, but even books like Rel|[this, was, the, first, clive, cussler, i've, ever, read,, but, even, books, like, rel] |[first, clive, cussler, ever, read,, even, books, like, rel] |\n",
      "|i liked the Da Vinci Code a lot.                                        |[i, liked, the, da, vinci, code, a, lot.]                                               |[liked, da, vinci, code, lot.]                               |\n",
      "|i liked the Da Vinci Code a lot.                                        |[i, liked, the, da, vinci, code, a, lot.]                                               |[liked, da, vinci, code, lot.]                               |\n",
      "|I liked the Da Vinci Code but it ultimatly didn't seem to hold it's own.|[i, liked, the, da, vinci, code, but, it, ultimatly, didn't, seem, to, hold, it's, own.]|[liked, da, vinci, code, ultimatly, seem, hold, own.]        |\n",
      "|that's not even an exaggeration ) and at midnight we went to Wal-Mart to|[that's, not, even, an, exaggeration, ), and, at, midnight, we, went, to, wal-mart, to] |[even, exaggeration, ), midnight, went, wal-mart]            |\n",
      "|I loved the Da Vinci Code, but now I want something better and different|[i, loved, the, da, vinci, code,, but, now, i, want, something, better, and, different] |[loved, da, vinci, code,, want, something, better, different]|\n",
      "|i thought da vinci code was great, same with kite runner.               |[i, thought, da, vinci, code, was, great,, same, with, kite, runner.]                   |[thought, da, vinci, code, great,, kite, runner.]            |\n",
      "|The Da Vinci Code is actually a good movie...                           |[the, da, vinci, code, is, actually, a, good, movie...]                                 |[da, vinci, code, actually, good, movie...]                  |\n",
      "|I thought the Da Vinci Code was a pretty good book.                     |[i, thought, the, da, vinci, code, was, a, pretty, good, book.]                         |[thought, da, vinci, code, pretty, good, book.]              |\n",
      "+------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-------------------------------------------------------------+\n",
      "only showing top 10 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# tokenization\n",
    "\n",
    "tokenization = Tokenizer(inputCol='Review', outputCol='review_token')\n",
    "\n",
    "train_df = tokenization.transform(train_df)\n",
    "\n",
    "# stop_words\n",
    "stop_words_removal = StopWordsRemover(inputCol='review_token', outputCol='new_token')\n",
    "\n",
    "train_df = stop_words_removal.transform(train_df)\n",
    "\n",
    "train_df.select(['Review', 'review_token', 'new_token']).show(10, False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+--------------------+-----------+\n",
      "|           new_token|token_count|\n",
      "+--------------------+-----------+\n",
      "|[da, vinci, code,...|          5|\n",
      "|[first, clive, cu...|          9|\n",
      "|[liked, da, vinci...|          5|\n",
      "|[liked, da, vinci...|          5|\n",
      "|[liked, da, vinci...|          8|\n",
      "|[even, exaggerati...|          6|\n",
      "|[loved, da, vinci...|          8|\n",
      "|[thought, da, vin...|          7|\n",
      "|[da, vinci, code,...|          6|\n",
      "|[thought, da, vin...|          7|\n",
      "+--------------------+-----------+\n",
      "only showing top 10 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# 计算过滤掉停用词后的评论长度\n",
    "from pyspark.sql.functions import udf, rand, col\n",
    "from pyspark.sql.types import IntegerType\n",
    "\n",
    "\n",
    "token_count = udf(lambda x: len(x), IntegerType())\n",
    "\n",
    "train_df = train_df.withColumn('token_count', token_count(col('new_token')))\n",
    "\n",
    "train_df.select(['new_token', 'token_count']).show(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### convert text into number\n",
    "\n",
    "- count vector\n",
    "- tf\n",
    "- tf-idf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+--------------------+-----------+--------------------+-----+\n",
      "|           new_token|token_count|        count_vector|label|\n",
      "+--------------------+-----------+--------------------+-----+\n",
      "|[da, vinci, code,...|          5|(2302,[0,1,4,43,2...|    1|\n",
      "|[first, clive, cu...|          9|(2302,[11,51,229,...|    1|\n",
      "|[liked, da, vinci...|          5|(2302,[0,1,4,53,3...|    1|\n",
      "|[liked, da, vinci...|          5|(2302,[0,1,4,53,3...|    1|\n",
      "|[liked, da, vinci...|          8|(2302,[0,1,4,53,6...|    1|\n",
      "|[even, exaggerati...|          6|(2302,[46,229,271...|    1|\n",
      "|[loved, da, vinci...|          8|(2302,[0,1,22,30,...|    1|\n",
      "|[thought, da, vin...|          7|(2302,[0,1,4,228,...|    1|\n",
      "|[da, vinci, code,...|          6|(2302,[0,1,4,33,2...|    1|\n",
      "|[thought, da, vin...|          7|(2302,[0,1,4,223,...|    1|\n",
      "+--------------------+-----------+--------------------+-----+\n",
      "only showing top 10 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# count vector\n",
    "count_vector = CountVectorizer(inputCol='new_token', outputCol='count_vector')\n",
    "\n",
    "train_count_vec = count_vector.fit(train_df).transform(train_df)\n",
    "\n",
    "train_count_vec.select(['new_token', 'token_count', 'count_vector', 'label']).show(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+--------------------+--------------------+-----+\n",
      "|           new_token|           tf_vector|label|\n",
      "+--------------------+--------------------+-----+\n",
      "|[da, vinci, code,...|(262144,[93284,11...|    1|\n",
      "|[first, clive, cu...|(262144,[47372,82...|    1|\n",
      "|[liked, da, vinci...|(262144,[32675,93...|    1|\n",
      "|[liked, da, vinci...|(262144,[32675,93...|    1|\n",
      "|[liked, da, vinci...|(262144,[5765,326...|    1|\n",
      "|[even, exaggerati...|(262144,[105591,1...|    1|\n",
      "|[loved, da, vinci...|(262144,[33933,11...|    1|\n",
      "|[thought, da, vin...|(262144,[2000,335...|    1|\n",
      "|[da, vinci, code,...|(262144,[93284,11...|    1|\n",
      "|[thought, da, vin...|(262144,[23661,93...|    1|\n",
      "+--------------------+--------------------+-----+\n",
      "only showing top 10 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# tf vector\n",
    "tf_vector = HashingTF(inputCol='new_token', outputCol='tf_vector')\n",
    "\n",
    "train_tf_vec = tf_vector.transform(train_df)\n",
    "\n",
    "train_tf_vec.select(['new_token', 'tf_vector', 'label']).show(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+--------------------+--------------------+-----+\n",
      "|           new_token|        tfidf_vector|label|\n",
      "+--------------------+--------------------+-----+\n",
      "|[da, vinci, code,...|(262144,[93284,11...|    1|\n",
      "|[first, clive, cu...|(262144,[47372,82...|    1|\n",
      "|[liked, da, vinci...|(262144,[32675,93...|    1|\n",
      "|[liked, da, vinci...|(262144,[32675,93...|    1|\n",
      "|[liked, da, vinci...|(262144,[5765,326...|    1|\n",
      "|[even, exaggerati...|(262144,[105591,1...|    1|\n",
      "|[loved, da, vinci...|(262144,[33933,11...|    1|\n",
      "|[thought, da, vin...|(262144,[2000,335...|    1|\n",
      "|[da, vinci, code,...|(262144,[93284,11...|    1|\n",
      "|[thought, da, vin...|(262144,[23661,93...|    1|\n",
      "+--------------------+--------------------+-----+\n",
      "only showing top 10 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# TF-IDF\n",
    "tfidf_vector = IDF(inputCol='tf_vector', outputCol='tfidf_vector')\n",
    "\n",
    "train_tfidf_vec = tfidf_vector.fit(train_tf_vec).transform(train_tf_vec)\n",
    "\n",
    "train_tfidf_vec.select(['new_token', 'tfidf_vector', 'label']).show(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+--------------------+-----+\n",
      "|                   X|Label|\n",
      "+--------------------+-----+\n",
      "|(2303,[0,1,4,43,2...|    1|\n",
      "|(2303,[11,51,229,...|    1|\n",
      "|(2303,[0,1,4,53,3...|    1|\n",
      "|(2303,[0,1,4,53,3...|    1|\n",
      "|(2303,[0,1,4,53,6...|    1|\n",
      "|(2303,[46,229,271...|    1|\n",
      "|(2303,[0,1,22,30,...|    1|\n",
      "|(2303,[0,1,4,228,...|    1|\n",
      "|(2303,[0,1,4,33,2...|    1|\n",
      "|(2303,[0,1,4,223,...|    1|\n",
      "+--------------------+-----+\n",
      "only showing top 10 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# 合并特征列\n",
    "\n",
    "from pyspark.ml.feature import VectorAssembler\n",
    "\n",
    "assembler = VectorAssembler(inputCols=['count_vector','token_count'], outputCol='X')\n",
    "\n",
    "train_count_vec = assembler.transform(train_count_vec)\n",
    "\n",
    "train_count_vec.select(['X', 'Label']).show(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+--------------------+-----+\n",
      "|                   X|Label|\n",
      "+--------------------+-----+\n",
      "|(262145,[93284,11...|    1|\n",
      "|(262145,[47372,82...|    1|\n",
      "|(262145,[32675,93...|    1|\n",
      "|(262145,[32675,93...|    1|\n",
      "|(262145,[5765,326...|    1|\n",
      "|(262145,[105591,1...|    1|\n",
      "|(262145,[33933,11...|    1|\n",
      "|(262145,[2000,335...|    1|\n",
      "|(262145,[93284,11...|    1|\n",
      "|(262145,[23661,93...|    1|\n",
      "+--------------------+-----+\n",
      "only showing top 10 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "assembler = VectorAssembler(inputCols=['tf_vector','token_count'], outputCol='X')\n",
    "\n",
    "train_tf_vec = assembler.transform(train_tf_vec)\n",
    "\n",
    "train_tf_vec.select(['X', 'Label']).show(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+--------------------+-----+\n",
      "|                   X|Label|\n",
      "+--------------------+-----+\n",
      "|(262145,[93284,11...|    1|\n",
      "|(262145,[47372,82...|    1|\n",
      "|(262145,[32675,93...|    1|\n",
      "|(262145,[32675,93...|    1|\n",
      "|(262145,[5765,326...|    1|\n",
      "|(262145,[105591,1...|    1|\n",
      "|(262145,[33933,11...|    1|\n",
      "|(262145,[2000,335...|    1|\n",
      "|(262145,[93284,11...|    1|\n",
      "|(262145,[23661,93...|    1|\n",
      "+--------------------+-----+\n",
      "only showing top 10 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "assembler = VectorAssembler(inputCols=['tfidf_vector', 'token_count'], outputCol='X')\n",
    "\n",
    "train_tfidf_vec = assembler.transform(train_tfidf_vec)\n",
    "\n",
    "train_tfidf_vec.select(['X', 'Label']).show(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## LogisticRegression\n",
    "\n",
    "对三组特征进行测试\n",
    "\n",
    "- countvector : train_count_vec\n",
    "- tf : train_tf_vec\n",
    "- tf-idf : train_tfidf_vec"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.ml.classification import LogisticRegression\n",
    "\n",
    "# count vector \n",
    "train_1, test_1 = train_count_vec.randomSplit([0.75, 0.25])\n",
    "# tf vector\n",
    "train_2, test_2 = train_tf_vec.randomSplit([0.75, 0.25])\n",
    "# tf-idf vector\n",
    "train_3, test_3 = train_tfidf_vec.randomSplit([0.75, 0.25])\n",
    "\n",
    "# Training model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [],
   "source": [
    "#model_1 = LogisticRegression(featuresCol='X', labelCol='Label').fit(train_1)\n",
    "model_2 = LogisticRegression(featuresCol='X', labelCol='Label').fit(train_2)\n",
    "model_3 = LogisticRegression(featuresCol='X', labelCol='Label').fit(train_3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [],
   "source": [
    "result_1 = model_1.evaluate(test_1).predictions\n",
    "result_2 = model_2.evaluate(test_2).predictions\n",
    "result_3 = model_3.evaluate(test_3).predictions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2 id='10'>Evaluation</h2>\n",
    "\n",
    "- train_1 : count vector\n",
    "- train_2 : tf vector\n",
    "- train_3 : tfidf vector"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator\n",
    "\n",
    "accuracy_1 = MulticlassClassificationEvaluator(labelCol='Label', metricName='accuracy').evaluate(result_1)\n",
    "accuracy_2 = MulticlassClassificationEvaluator(labelCol='Label', metricName='accuracy').evaluate(result_2)\n",
    "accuracy_3 = MulticlassClassificationEvaluator(labelCol='Label', metricName='accuracy').evaluate(result_3)\n",
    "\n",
    "precision_1 = MulticlassClassificationEvaluator(labelCol='Label', metricName='weightedPrecision').evaluate(result_1)\n",
    "precision_2 = MulticlassClassificationEvaluator(labelCol='Label', metricName='weightedPrecision').evaluate(result_2)\n",
    "precision_3 = MulticlassClassificationEvaluator(labelCol='Label', metricName='weightedPrecision').evaluate(result_3)\n",
    "\n",
    "auc_1 = BinaryClassificationEvaluator(labelCol='Label').evaluate(result_1)\n",
    "auc_2 = BinaryClassificationEvaluator(labelCol='Label').evaluate(result_2)\n",
    "auc_3 = BinaryClassificationEvaluator(labelCol='Label').evaluate(result_3)\n",
    "\n",
    "scores_df = pd.DataFrame({'feature_type':['Count_vec', 'TF_vec', 'TF-IDF_vec'],\n",
    "                                             'accuracy':[accuracy_1, accuracy_2, accuracy_3],\n",
    "                                             'precision':[precision_1, precision_2, precision_3],\n",
    "                                             'auc':[auc_1, auc_2, auc_3]})\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<matplotlib.axes._subplots.AxesSubplot at 0x7fcb089dbe10>"
      ]
     },
     "execution_count": 84,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAABIsAAAGwCAYAAAApCuWnAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvOIA7rQAAIABJREFUeJzt3X2UbHV9Jvrn6wHEF4wvMF4jCL6OMkoQEaPGC2p0TJz4gsaIGnUyCbNWNGYy1zuj1yyThWM0M2rGjGZlGIOJidEoUS9GEjQoiY7xykFARYPBV158IVFUNErA7/2j9tkpmtPVfaC7q3fz+axVi6q9d1X9+vwO+zn91N67qrsDAAAAAElyi2UPAAAAAIDtQ1kEAAAAwEhZBAAAAMBIWQQAAADASFkEAAAAwEhZBAAAAMBIWQQAcCNV1UVVdcIa29ytqq6uql1bNCwAgJtEWcSkVdU5VfWNqrrlsscCwPZSVV+oqn8cipqvVtXvV9VtN/I9uvtfdfc5a2zzpe6+bXdft5HvDcBy7e13kWHZz6/Y7oSqumzucVXVC6rqk1X1naq6rKreXlUP2MrxwyLKIiarqo5I8ogkneQJW/i++23VewFwk/1Ud982yTFJjk3yq/Mrh3+w+/cQAPvkJv4u8tokv5zkBUnumOQ+Sd6V5PEbN0K4afzjiCl7dpKPJPn9JM/Zs7CqblVVr66qL1bVN6vqQ1V1q2Hdj1XVh6vqqqq6tKqeOyy/3icAVfXcqvrQ3OOuqudV1d8l+bth2WuH1/hWVZ1XVY+Y235XVf0/VfXZqvr2sP6wqnp9Vb16/oeoqjOq6lc24w8IgJnuvjzJnye5/7DPf3lV/e8k301yj6r6oar6var6clVdXlX/Zf60sar6har69LBP/1RVHTMs/0JV/fhw/7iq2j3kwler6jXD8iOGHNlvePzDw77/61V1SVX9wtz7/HpVva2q3jS810VVdezW/UkBsE57/V1kLVV17yTPS3JSd7+/u7/f3d/t7jd39ys3Z6iw75RFTNmzk7x5uP3rqrrzsPxVSR6U5GGZNfX/KckPqurwzH5R+B9JDklydJIL9uH9npTkIUmOHB6fO7zGHZP8cZK3V9WBw7r/mOSkJD+Z5HZJfi6zX0j+IMlJez7FrqqDk/z48HwANklVHZbZPvn8YdHPJjk5yUFJvpjZP/avTXKvJA9M8tgkPz8896eT/HpmuXO7zD5B/oe9vM1rk7y2u2+X5J5J3rbKcN6a5LIkP5zkqUl+o6oeNbf+CcM2t09yRpLX7eOPC8DmW+13kbU8Osll3f3RTRsZbABlEZNUVT+W5PAkb+vu85J8NskzhhLm55L8cndf3t3XdfeHu/v7SZ6R5C+7+y3d/U/d/Q/dvS9l0Su6++vd/Y9J0t1/NLzGtd396iS3TPIvh21/PsmvdvfFPXPhsO1Hk3wzs5BIkqcnOae7v3oT/0gA2Lt3VdVVST6U5K+S/Maw/Pe7+6Luvjaz0v8nk/yH7v5Od38tyW9lto9OZvv0/9rd5w779Eu6+4t7ea9/SnKvqjq4u6/u7o+s3GAorR6e5D939/eGHHpDZr907PGh7j5zuMbRHyb5kZv6hwDAxlntd5F1Pv1OSb68WWODjaIsYqqek+S93f33w+M/HpYdnOTAzHbYKx22yvL1unT+QVW9cDgl4ZvDLyI/NLz/Wu/1B0meNdx/Vma/CACwOZ7U3bfv7sO7+xf3FP65/j798CT7J/nycJryVUn+Z5J/Maxfb378u8yuO/G3VXVuVf2bvWzzw0m+3t3fnlv2xSR3nXv8lbn7301yoOvlAWwrq/0uksyOUt1/xfb7Z/aBQjI7MvUumz5CuIn8w4PJGa4/9LQku6pqzz+ob5nZ4fp3SfK9zA7/v3DFUy9NctwqL/udJLeee/x/7GWbnhvDIzI7ve3RSS7q7h9U1TeS1Nx73TPJJ/fyOn+U5JNV9SNJ7pfZxewA2Fo9d//SJN9PcvBwpNFKe/bpi1+w++/yz6can5jk9Kq604rNrkhyx6o6aK4wuluSy/f1BwBg6y36XWT49/2Xkhyx4ml3z+yDgSQ5O8nrq+rY7t69BUOGG8WRRUzRk5Jcl9m1g44ebvdL8sHMDuM/LclrhguI7qqqh9bs6yzfnOTHq+ppVbVfVd2pqo4eXvOCJCdW1a2r6l6ZfTq8yEGZfWpwZZL9quqlmV3HYo83JHlZVd17+Kado/b8wtDdl2V2vaM/TPKnc59yA7AE3f3lJO9N8uqqul1V3aKq7llVxw+bvCHJC6vqQcM+/V7DdfCup6qeVVWHdPcPklw1LP7Bive6NMmHk7yiqg6sqqMyy5w/2qyfD4ANtdbvIn+S5N8OX3pQVXWfJL+S2bXo9nyw8DtJ3lJVJ1TVAUMePL2qXrSEnwf2SlnEFD0nyRu7+0vd/ZU9t8wuAPrMJC9K8onMCpmvJ/nNJLfo7i9ldk2K/2tYfkH++ToQv5XkmiRfzew0sTevMYazkvxFks9k9inB93L9Uxpek9mFTd+b5FtJfi/JrebW/0GSB8QpaADbxbOTHJDkU0m+keT0DKcJdPfbk7w8s9MMvp3ZEaF33MtrPC7JRVV1dWYXu376Kh8InJTZp85XJHlnkl/r7r/cyB8GgE2z1u8iZ2f2+8gbM7tW6ZmZ/dv/1LnXeMGw/esz+3Dhs0menOTdW/ZTwBqqu9feCthQVfV/ZvYp8uHtf0IAAAC2EUcWwRarqv2T/HKSNyiKAAAA2G7WLIuq6rSq+lpV7e1CvRnOw/ztqrqkqj5eVcfMrXtOVf3dcHvO3p4PNydVdb/MDjW9S5L/vuThwIaQEwAsIicApmfN09CG02WuTvKm7r7/Xtb/ZJJfyuxaMA9J8trufkhV3THJ7iTHZvaNI+cleVB3f2NjfwQAlklOALCInACYnjWPLOruv87sYsCreWJmO/7u7o9k9pWBd0nyr5O8r7u/PuzQ35fZhR8B2EHkBACLyAmA6dlvA17jrrn+t0BdNixbbfkNVNXJSU5Oktvc5jYPuu9977sBwwLYec4777y/7+5Dlj2OfSQnALaInJATAKvZl4zYiLLoJuvuUzN8leCxxx7bu3fvXvKIALanqvrissewDHICYH3khJwAWM2+ZMRGfBva5UkOm3t86LBsteUA3LzICQAWkRMA28xGlEVnJHn28C0GP5rkm9395SRnJXlsVd2hqu6Q5LHDMgBuXuQEAIvICYBtZs3T0KrqLUlOSHJwVV2W5NeS7J8k3f27Sc7M7JsLLkny3ST/dlj39ap6WZJzh5c6pbsXXdgOgAmSEwAsIicApmfNsqi7T1pjfSd53irrTkty2o0bGgBTICcAWEROAEzPRpyGBgAAAMAOoSwCAAAAYKQsAgAAAGCkLAIAAABgtOYFrgEAANi7I170nmUPYVN84ZWPX/YQgCVyZBEAAAAAI0cWAQAAANvWTj2CL9m+R/E5sggAAACAkbIIAAAAgJGyCAAAAICRsggAAACAkbIIAAAAgJFvQwMAAOBmwzdrwdocWQQAAADASFkEAAAAwEhZBAAAAMBIWQQAAADASFkEAAAAwEhZBAAAAMBIWQQAAADASFkEAAAAwEhZBAAAAMBIWQQAAADASFkEAAAAwEhZBAAAAMBIWQQAAADASFkEAAAAwEhZBAAAAMBIWQQAAADASFkEAAAAwEhZBAAAAMBIWQQAAADASFkEAAAAwEhZBAAAAMBIWQQAAADASFkEAAAAwEhZBAAAAMBIWQQAAADASFkEAAAAwEhZBAAAAMBIWQQAAADASFkEAAAAwEhZBAAAAMBIWQQAAADASFkEAAAAwEhZBAAAAMBIWQQAAADASFkEAAAAwEhZBAAAAMBIWQQAAADASFkEAAAAwEhZBAAAAMBIWQQAAADASFkEAAAAwEhZBAAAAMBIWQQAAADAaF1lUVU9rqourqpLqupFe1l/eFWdXVUfr6pzqurQuXW/WVWfHG4/s5GDB2B7kBMALCInAKZlzbKoqnYleX2Sn0hyZJKTqurIFZu9KsmbuvuoJKckecXw3McnOSbJ0UkekuSFVXW7jRs+AMsmJwBYRE4ATM96jiw6Lskl3f257r4myVuTPHHFNkcmef9w/wNz649M8tfdfW13fyfJx5M87qYPG4BtRE4AsIicAJiY9ZRFd01y6dzjy4Zl8y5McuJw/8lJDqqqOw3LH1dVt66qg5M8MslhN23IAGwzcgKAReQEwMRs1AWuX5jk+Ko6P8nxSS5Pcl13vzfJmUk+nOQtSf4myXUrn1xVJ1fV7qrafeWVV27QkADYRuQEAIvICYBtZD1l0eW5fnt/6LBs1N1XdPeJ3f3AJC8Zll01/Pfl3X10dz8mSSX5zMo36O5Tu/vY7j72kEMOuZE/CgBLIicAWEROAEzMesqic5Pcu6ruXlUHJHl6kjPmN6iqg6tqz2u9OMlpw/Jdw+GjqaqjkhyV5L0bNXgAtgU5AcAicgJgYvZba4Puvraqnp/krCS7kpzW3RdV1SlJdnf3GUlOSPKKquokf53kecPT90/ywapKkm8leVZ3X7vxPwYAyyInAFhETgBMz5plUZJ095mZnSs8v+ylc/dPT3L6Xp73vcy+wQCAHUxOALCInACYlo26wDUAAAAAO4CyCAAAAICRsggAAACAkbIIAAAAgJGyCAAAAICRsggAAACAkbIIAAAAgJGyCAAAAICRsggAAACAkbIIAAAAgJGyCAAAAICRsggAAACAkbIIAAAAgJGyCAAAAICRsggAAACAkbIIAAAAgJGyCAAAAICRsggAAACAkbIIAAAAgJGyCAAAAICRsggAAACAkbIIAAAAgJGyCAAAAICRsggAAACAkbIIAAAAgJGyCAAAAICRsggAAACAkbIIAAAAgJGyCAAAAICRsggAAACAkbIIAAAAgJGyCAAAAICRsggAAACAkbIIAAAAgJGyCAAAAICRsggAAACAkbIIAAAAgJGyCAAAAICRsggAAACAkbIIAAAAgJGyCAAAAICRsggAAACAkbIIAAAAgJGyCAAAAICRsggAAACAkbIIAAAAgJGyCAAAAICRsggAAACAkbIIAAAAgJGyCAAAAICRsggAAACAkbIIAAAAgJGyCAAAAICRsggAAACAkbIIAAAAgJGyCAAAAICRsggAAACA0brKoqp6XFVdXFWXVNWL9rL+8Ko6u6o+XlXnVNWhc+v+a1VdVFWfrqrfrqrayB8AgOWTEwAsIicApmXNsqiqdiV5fZKfSHJkkpOq6sgVm70qyZu6+6gkpyR5xfDchyV5eJKjktw/yYOTHL9howdg6eQEAIvICYDpWc+RRccluaS7P9fd1yR5a5InrtjmyCTvH+5/YG59JzkwyQFJbplk/yRfvamDBmBbkRMALCInACZmPWXRXZNcOvf4smHZvAuTnDjcf3KSg6rqTt39N5nt7L883M7q7k+vfIOqOrmqdlfV7iuvvHJffwYAlktOALCInACYmI26wPULkxxfVedndljo5Umuq6p7JblfkkMzC4RHVdUjVj65u0/t7mO7+9hDDjlkg4YEwDYiJwBYRE4AbCP7rWOby5McNvf40GHZqLuvyPBJQFXdNslTuvuqqvqFJB/p7quHdX+e5KFJPrgBYwdge5ATACwiJwAmZj1HFp2b5N5VdfeqOiDJ05OcMb9BVR1cVXte68VJThvufymzTwj2q6r9M/uU4AaHjQIwaXICgEXkBMDErFkWdfe1SZ6f5KzMdsxv6+6LquqUqnrCsNkJSS6uqs8kuXOSlw/LT0/y2SSfyOw85Au7+90b+yMAsExyAoBF5ATA9KznNLR095lJzlyx7KVz90/PbEe+8nnXJfn3N3GMAGxzcgKAReQEwLRs1AWuAQAAANgBlEUAAAAAjJRFAAAAAIyURQAAAACMlEUAAAAAjJRFAAAAAIyURQAAAACMlEUAAAAAjJRFAAAAAIyURQAAAACMlEUAAAAAjJRFAAAAAIyURQAAAACMlEUAAAAAjJRFAAAAAIyURQAAAACMlEUAAAAAjJRFAAAAAIyURQAAAACMlEUAAAAAjJRFAAAAAIz2W/YAAGA7O+JF71n2EDbFF175+GUPAQCAbepmXxbt1F8CEr8IAAAAAPvOaWgAAAAAjJRFAAAAAIyURQAAAACMlEUAAAAAjJRFAAAAAIyURQAAAACMlEUAAAAAjJRFAAAAAIyURQAAAACMlEUAAAAAjJRFAAAAAIyURQAAAACM9lv2AAAA4IgXvWfZQ9gUX3jl45c9BADYZ44sAgAAAGCkLAIAAABg5DQ0YEs4vQAAAGAaHFkEAAAAwEhZBAAAAMDIaWgAwI6yU097TZz6CgBsDWURk+OXAAAAANg8TkMDAAAAYKQsAgAAAGCkLAIAAABgpCwCAAAAYKQsAgAAAGCkLAIAAABgpCwCAAAAYKQsAgAAAGCkLAIAAABgpCwCAAAAYKQsAgAAAGCkLAIAAABgpCwCAAAAYLSusqiqHldVF1fVJVX1or2sP7yqzq6qj1fVOVV16LD8kVV1wdzte1X1pI3+IQBYLjkBwCJyAmBa1iyLqmpXktcn+YkkRyY5qaqOXLHZq5K8qbuPSnJKklckSXd/oLuP7u6jkzwqyXeTvHcDxw/AkskJABaREwDTs54ji45Lckl3f667r0ny1iRPXLHNkUneP9z/wF7WJ8lTk/x5d3/3xg4WgG1JTgCwiJwAmJj1lEV3TXLp3OPLhmXzLkxy4nD/yUkOqqo7rdjm6Unesrc3qKqTq2p3Ve2+8sor1zEkALYROQHAInICYGI26gLXL0xyfFWdn+T4JJcnuW7Pyqq6S5IHJDlrb0/u7lO7+9juPvaQQw7ZoCEBsI3ICQAWkRMA28h+69jm8iSHzT0+dFg26u4rMnwSUFW3TfKU7r5qbpOnJXlnd//TTRsuANuQnABgETkBMDHrObLo3CT3rqq7V9UBmR3+ecb8BlV1cFXtea0XJzltxWuclFUOGQVg8uQEAIvICYCJWbMs6u5rkzw/s0M+P53kbd19UVWdUlVPGDY7IcnFVfWZJHdO8vI9z6+qIzL7JOGvNnTkAGwLcgKAReQEwPSs5zS0dPeZSc5cseylc/dPT3L6Ks/9Qm54ATsAdhA5AcAicgJgWjbqAtcAAAAA7ADKIgAAAABGyiIAAAAARsoiAAAAAEbKIgAAAABGyiIAAAAARsoiAAAAAEbKIgAAAABGyiIAAAAARsoiAAAAAEbKIgAAAABGyiIAAAAARsoiAAAAAEbKIgAAAABGyiIAAAAARsoiAAAAAEbKIgAAAABGyiIAAAAARsoiAAAAAEbKIgAAAABGyiIAAAAARsoiAAAAAEbKIgAAAABGyiIAAAAARsoiAAAAAEbKIgAAAABGyiIAAAAARsoiAAAAAEbKIgAAAABGyiIAAAAARsoiAAAAAEbKIgAAAABGyiIAAAAARsoiAAAAAEbKIgAAAABGyiIAAAAARsoiAAAAAEbKIgAAAABGyiIAAAAARsoiAAAAAEbKIgAAAABGyiIAAAAARsoiAAAAAEbKIgAAAABGyiIAAAAARsoiAAAAAEbKIgAAAABGyiIAAAAARsoiAAAAAEbKIgAAAABGyiIAAAAARsoiAAAAAEbKIgAAAABGyiIAAAAARsoiAAAAAEbKIgAAAABG6yqLqupxVXVxVV1SVS/ay/rDq+rsqvp4VZ1TVYfOrbtbVb23qj5dVZ+qqiM2bvgAbAdyAoBF5ATAtKxZFlXVriSvT/ITSY5MclJVHblis1cleVN3H5XklCSvmFv3piT/rbvvl+S4JF/biIEDsD3ICQAWkRMA07OeI4uOS3JJd3+uu69J8tYkT1yxzZFJ3j/c/8Ce9UMI7Nfd70uS7r66u7+7ISMHYLuQEwAsIicAJmY9ZdFdk1w69/iyYdm8C5OcONx/cpKDqupOSe6T5KqqekdVnV9V/234ZOF6qurkqtpdVbuvvPLKff8pAFgmOQHAInICYGI26gLXL0xyfFWdn+T4JJcnuS7JfkkeMax/cJJ7JHnuyid396ndfWx3H3vIIYds0JAA2EbkBACLyAmAbWQ9ZdHlSQ6be3zosGzU3Vd094nd/cAkLxmWXZXZpwYXDIecXpvkXUmO2ZCRA7BdyAkAFpETABOznrLo3CT3rqq7V9UBSZ6e5Iz5Darq4Kra81ovTnLa3HNvX1V76v1HJfnUTR82ANuInABgETkBMDFrlkVDg//8JGcl+XSSt3X3RVV1SlU9YdjshCQXV9Vnktw5ycuH516X2SGjZ1fVJ5JUkv+14T8FAEsjJwBYRE4ATM9+69mou89McuaKZS+du396ktNXee77khx1E8YIwDYnJwBYRE4ATMtGXeAaAAAAgB1AWQQAAADASFkEAAAAwEhZBAAAAMBIWQQAAADASFkEAAAAwEhZBAAAAMBIWQQAAADASFkEAAAAwEhZBAAAAMBIWQQAAADASFkEAAAAwEhZBAAAAMBIWQQAAADASFkEAAAAwEhZBAAAAMBIWQQAAADASFkEAAAAwEhZBAAAAMBIWQQAAADASFkEAAAAwEhZBAAAAMBIWQQAAADASFkEAAAAwEhZBAAAAMBIWQQAAADASFkEAAAAwEhZBAAAAMBIWQQAAADASFkEAAAAwEhZBAAAAMBIWQQAAADASFkEAAAAwEhZBAAAAMBIWQQAAADASFkEAAAAwEhZBAAAAMBIWQQAAADASFkEAAAAwEhZBAAAAMBIWQQAAADASFkEAAAAwEhZBAAAAMBIWQQAAADASFkEAAAAwEhZBAAAAMBIWQQAAADASFkEAAAAwEhZBAAAAMBIWQQAAADASFkEAAAAwEhZBAAAAMBIWQQAAADASFkEAAAAwEhZBAAAAMBIWQQAAADAaF1lUVU9rqourqpLqupFe1l/eFWdXVUfr6pzqurQuXXXVdUFw+2MjRw8ANuDnABgETkBMC37rbVBVe1K8vokj0lyWZJzq+qM7v7U3GavSvKm7v6DqnpUklck+dlh3T9299EbPG4Atgk5AcAicgJgetZzZNFxSS7p7s919zVJ3prkiSu2OTLJ+4f7H9jLegB2LjkBwCJyAmBi1lMW3TXJpXOPLxuWzbswyYnD/ScnOaiq7jQ8PrCqdlfVR6rqSTdptABsR3ICgEXkBMDErHka2jq9MMnrquq5Sf46yeVJrhvWHd7dl1fVPZK8v6o+0d2fnX9yVZ2c5OTh4dVVdfEGjWu7OTjJ32/Vm9VvbtU77XhbNm/mbMPs5Dk7fMvfcWPIifXZyX93dyrZPk07+f81OSEnNoT9zYaRE9O0U/9fW3dGrKcsujzJYXOPDx2Wjbr7igyfBFTVbZM8pbuvGtZdPvz3c1V1TpIHJvnsiuefmuTU9Q56qqpqd3cfu+xxsG/M2/SYsy0nJzaIv7vTY86mybxtOTmxQfzdnR5zNk3mbX2noZ2b5N5VdfeqOiDJ05Nc71sIqurgqtrzWi9Octqw/A5Vdcs92yR5eJL5C9kBMH1yAoBF5ATAxKxZFnX3tUmen+SsJJ9O8rbuvqiqTqmqJwybnZDk4qr6TJI7J3n5sPx+SXZX1YWZXajulSu+9QCAiZMTACwiJwCmp7p72WO42aiqk4dDZJkQ8zY95oyp8nd3eszZNJk3psrf3ekxZ9Nk3pRFAAAAAMxZzzWLAAAAALiZUBYBAAAAMFIWAQAAADBSFm2yqvrRqjpo7vHtquohyxwTi1XV86rq9nOP71BVv7jMMbGYOWPK5MT02OdMjzljyuTE9NjnTI85uyEXuN5kVXV+kmN6+IOuqlsk2d3dxyx3ZKymqi7o7qNXLDu/ux+4rDGxmDljyuTE9NjnTI85Y8rkxPTY50yPObshRxZtvuq5Rq67f5BkvyWOh7Xtqqra86CqdiU5YInjYW3mjCmTE9NjnzM95owpkxPTY58zPeZsBWXR5vtcVb2gqvYfbr+c5HPLHhQL/UWSP6mqR1fVo5O8ZVjG9mXOmDI5MT32OdNjzpgyOTE99jnTY85WcBraJquqf5Hkt5M8KkknOTvJf+jury11YKxqOLT33yd59LDofUne0N3XLW9ULGLOmDI5MT32OdNjzpgyOTE99jnTY85uSFkEe1FVt0pyt+6+eNljYX3MGbCV7HOmx5wBW8k+Z3rM2fU5DW2TVdV9qursqvrk8PioqvrVZY+L1VXVE5JckOGww6o6uqrOWO6oWMScMWVyYnrsc6bHnDFlcmJ67HOmx5zdkLJo8/2vJC9O8k9J0t0fT/L0pY6ItfxakuOSXJUk3X1BkrsvdUSsxZwxZXJieuxzpsecMWVyYnrsc6bHnK2gLNp8t+7uj65Ydu1SRsJ6/VN3f3PFMudrbm/mjCmTE9NjnzM95owpkxPTY58zPeZsBV+5uPn+vqrumeEvWlU9NcmXlzsk1nBRVT0js69PvHeSFyT58JLHxGLmjCmTE9NjnzM95owpkxPTY58zPeZsBRe43mRVdY8kpyZ5WJJvJPl8kmd29xeXOjBWVVW3TvKSJI8dFp2V5L909/eWNyoWMWdMmZyYHvuc6TFnTJmcmB77nOkxZzekLNpkVbWru6+rqtskuUV3f3vZY2Kxqjqmuz+27HGwfuaMKZMT02OfMz3mjCmTE9NjnzM95uyGXLNo832+qk5N8qNJrl72YFiXV1fVp6vqZVV1/2UPhnUxZ0yZnJge+5zpMWdMmZyYHvuc6TFnKyiLNt99k/xlkudltqN/XVX92JLHxALd/cgkj0xyZZL/WVWf8PWk25s5Y+LkxMTY50yPOWPi5MTE2OdMjzm7IaehbaGqukOS12Z2jvGuZY+HtVXVA5L8pyQ/090HLHs8rM2cMWVyYnrsc6bHnDFlcmJ67HOmx5zNOLJoC1TV8VX1O0nOS3JgkqcteUgsUFX3q6pfr6pPJPkfmV0F/9AlD4sFzBlTJyemxT5neswZUycnpsU+Z3rM2Q05smiTVdUXkpyf5G1Jzuju7yx3RKylqv4myVuTvL27r1j2eFibOWPK5MT02OdMjzljyuTE9NjnTI85uyFl0Sarqtt197cWrH9xd79iK8fETVNVf9rdT1n2OFg/c8Z2Jid2Hvuc6TFnbGdyYuexz5mem+OGbcJwAAAKOUlEQVScOQ1tky3asQ9+eksGwka6x7IHwD4zZ2xbcmJHss+ZHnPGtiUndiT7nOm52c2Zsmj5atkDYJ85HG96zBlTJiemxz5neswZUyYnpsc+Z3pudnOmLFq+m91fOgD2iZwAYBE5AWw4ZdHy+SRgeszZ9Jgzpszf3+kxZ9Njzpgyf3+nx5xNz81uzpRFm6yqHr7Gsrdv4XBYoKruts5N//OmDoR1M2fsBHJiOuxzpsecsRPIiemwz5kec7Y634a2yarqY919zFrLWL75ebk5Xu1+iswZO4GcmA77nOkxZ+wEcmI67HOmx5ytbr9lD2CnqqqHJnlYkkOq6j/Orbpdkl3LGRVrmD+08GZ3tfuJMmdMlpyYJPuc6TFnTJacmCT7nOkxZ6tQFm2eA5LcNrM/44Pmln8ryVOXMiLW0qvcZ/syZ0yZnJge+5zpMWdMmZyYHvuc6TFnq3Aa2iarqsO7+4vLHgdrq6ofJLk6s3b5Vkm+u2dVku7u2y1rbOxdVV2X5DsxZ0yYnJgOOTE9coKdQE5Mh5yYHjmxOkcWbb5bVtWpSY7I3J93dz9qaSNiNRd29wOXPQjWr7sdgs1OICemQ05MjJxgh5AT0yEnJkZOrE5ZtPnenuR3k7whyXVLHguLOcxuoqrqAUnuOzz8VHdftMzxwD6SE9MhJyZKTjBxcmI65MREyYkbchraJquq87r7QcseB2urqsuSvGa19d296jqWo6p+KMn/m+RuSS7M7HDRByT5UpIndve3ljg8WBc5MR1yYnrkBDuBnJgOOTE9cmJ1jizafO+uql9M8s4k39+zsLu/vrwhsYpdmV1EsNbakG3jZUl2J3lUd/8gSarqFklemeTlSX5piWOD9ZIT0yEnpkdOsBPIiemQE9MjJ1bhyKJNVlWf38vi7m5fy7fNVNXHuvuYZY+D9auqTyU5qruvXbF8vySf6O77LWdksH5yYjrkxPTICXYCOTEdcmJ65MTqHFm0ybr77sseA+vmE4DpuWbljj1Juvvaqvr+3p4A242cmBQ5MT1ygsmTE5MiJ6ZHTqxCWbTJqurZe1ve3W/a6rGwpkcvewDsswOr6oG5YTBXklsuYTywz+TEpMiJ6ZETTJ6cmBQ5MT1yYhXKos334Ln7B2a2A/lYEjv3bcZ535P0lax+EcGvbOVA4CaQExMhJyZJTrATyImJkBOTJCdW4ZpFW6yqbp/krd39uGWPBYDtR04AsIicALaCI4u23neSOO8YNkBVnbhofXe/Y6vGAhtITsAGkRPsUHICNoicWJ2yaJNV1buT7Dl8a1eS+yV52/JGBDvKTy1Y10lutjt3pkNOwKaSE0yenIBNJSdW4TS0TVZVx889vDbJF7v7smWNB4DtRU4AsIicAJbhFssewE7X3X+V5G+THJTkDkmuWe6IYGerqj9b9hhgX8gJ2FpygqmRE7C15MSMsmiTVdXTknw0yU8neVqS/6+qnrrcUcGOdtdlDwD2hZyALScnmBQ5AVtOTsQ1i7bCS5I8uLu/liRVdUiSv0xy+lJHBTvX+cseAOwjOQFbS04wNXICtpaciCOLtsIt9uzYB/8Qf+6wIarqbiuXdffPLWMscBPICdgkcoIdQk7AJpETq7OT2Xx/UVVnVdVzq+q5Sd6T5Mwljwl2inftuVNVf7rMgcBNICdg88gJdgI5AZtHTqzCaWibpKruleTO3f1/V9WJSX5sWPU3Sd68vJHBjlJz9++xtFHAjSAnYEvICSZLTsCWkBOrUBZtnv+e5MVJ0t3vSPKOJKmqBwzrfmp5Q4Mdo1e5D1MgJ2DzyQmmTE7A5pMTq6hufx6boarO7e4Hr7LuE939gK0eE+w0VfWDJFdn9onArZJ8d8+qJN3dt1vW2GAtcgI2n5xgyuQEbD45sTpHFm2e2y9Yd6stGwXsbBd29wOXPQi4keQEbD45wZTJCdh8cmIVLnC9eXZX1S+sXFhVP5/kvCWMB3Yih0YyZXICNp+cYMrkBGw+ObEKp6Ftkqq6c5J3Jrkm/7wzPzbJAUme3N1fWdbYYKeoqsuSvGa19d296jpYNjkBm09OMGVyAjafnFid09A2SXd/NcnDquqRSe4/LH5Pd79/icOCnWZXktvm+t9iAJMgJ2BLyAkmS07AlpATq3BkETBZVfWx7j5m2eMAYHuSEwAsIidW55pFwJT5BACAReQEAIvIiVU4sgiYrKq6Y3d/fdnjAGB7khMALCInVqcsAgAAAGDkNDQAAAAARsoiAAAAAEbKIgAAAABGyiImoapeUFWfrqo37+PzjqiqZ2zWuObe57lV9cOb/T4A7J2cAGAROQH7RlnEVPxiksd09zP38XlHJNnnnXtV7drHpzw3iZ07wPLICQAWkROwD5RFbHtV9btJ7pHkz6vqJVV1WlV9tKrOr6onDtscUVUfrKqPDbeHDU9/ZZJHVNUFVfUrQ2P/urnX/rOqOmG4f3VVvbqqLkzy0Kp6UFX9VVWdV1VnVdVdVhnfU5Mcm+TNw/s8vqreNbf+MVX1zrn3+K2quqiqzq6qQ4bl96yqvxje64NVdd+N/nME2KnkBACLyAm4EbrbzW3b35J8IcnBSX4jybOGZbdP8pkkt0ly6yQHDsvvnWT3cP+EJH829zrPTfK6ucd/luSE4X4nedpwf/8kH05yyPD4Z5KctmB85yQ5drhfSf527rl/nOSn5t7jmcP9l+4ZS5Kzk9x7uP+QJO9f9p+5m5ub25RucsLNzc3NbdFNTri57dttv8C0PDbJE6rqhcPjA5PcLckVSV5XVUcnuS7JfW7Ea1+X5E+H+/8yyf2TvK+qkmRXki+v50W6u6vqD5M8q6remOShSZ49rP5Bkj8Z7v9RkndU1W2TPCzJ24f3SpJb3ojxAyAnAFhMTsA6KIuYmkrylO6++HoLq349yVeT/Ehmp1d+b5XnX5vrn3554Nz973X3dXPvc1F3P/RGjvONSd49jOPt3X3tKtv1MJ6ruvvoG/leAPwzOQHAInIC1sE1i5ias5L8Ug2VeVU9cFj+Q0m+3N0/SPKzmTX3SfLtJAfNPf8LSY6uqltU1WFJjlvlfS5OckhVPXR4n/2r6l8tGNf13qe7r8js04lfzWxHv8ctkjx1uP+MJB/q7m8l+XxV/fTwXlVVP7LgvQBYnZwAYBE5AeugLGJqXpbZ+b8fr6qLhsdJ8jtJnjNcTO6+Sb4zLP94kuuq6sKq+pUk/zvJ55N8KslvJ/nY3t6ku6/JbCf8m8NrXpDZoZ2r+f0kvztckO5Ww7I3J7m0uz89t913khxXVZ9M8qgkpwzLn5nk3w3vdVGSJ675JwHA3sgJABaRE7AO1d3LHgPsSMO3JJzf3b83t+zq7r7tEocFwDYhJwBYRE6wTMoi2ARVdV5mrf9juvv7c8vt3AGQEwAsJCdYNmUR7IOqen2Sh69Y/NrufuPetgfg5kVOALCInGAqlEUAAAAAjFzgGgAAAICRsggAAACAkbIIAAAAgJGyCAAAAIDR/w/bdvlvWqTD1gAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<Figure size 1440x432 with 3 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "plt.figure(figsize=(20, 6))\n",
    "plt.subplot(1, 3, 1)\n",
    "plt.title('Accuracy')\n",
    "plt.ylim(0.95, 1)\n",
    "scores_df.set_index('feature_type').accuracy.plot(kind='bar')\n",
    "plt.subplot(1, 3, 2)\n",
    "plt.title('Precision')\n",
    "plt.ylim(0.95, 1)\n",
    "scores_df.set_index('feature_type').precision.plot(kind='bar')\n",
    "plt.subplot(1, 3, 3)\n",
    "plt.title('AUC')\n",
    "plt.ylim(0.95, 1)\n",
    "scores_df.set_index('feature_type').auc.plot(kind='bar')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
